Data dictionary¶

  • Variable Name = Description
  • age = Age
  • job = Type of job
  • marital = Marital status
  • education = Level of education
  • default = Has credit in default
  • housing = Has a housing loan
  • loan = Has a personal loan
  • contact = Contact communication type
  • day_of_week = Day of contact
  • month = Month of contact
  • duration = Last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y="no"). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known.
  • campaign = Number of contacts performed during this campaign and for this client
  • pdays = Number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)
  • previous = Number of contacts performed before this campaign and for this client
  • poutcome = Outcome of the previous marketing campaign
  • emp.var.rate = employment variation rate - quarterly indicator (numeric)
  • cons.price.idx = consumer price index - monthly indicator (numeric)
  • cons.conf.idx = consumer confidence index - monthly indicator (numeric)
  • euribor3m = euribor 3 month rate - daily indicator (numeric)
  • nr.employed = number employed - quarterly indicator (numeric)
  • y = Did the client subscribe to a Telecom plan?

Exploratory Data Analysis¶

  • Step 1: Import data

  • Step 2: Collect initial data

  • Step 3: Describe data

  • Step 4: Verify data quality

  • Step 5: Explore data

    • Step 5.1: Summary Statistics
    • Step 5.2: Identifying Outliers
    • Step 5.3: Relationship Between Variables

(Hotz, 2023)

Step 1 - Import data¶

In [94]:
# Import library for reading .csv files
import pandas as pd # Data analysis and manipulation library
import re # Module used for string searching and manipulation
import io # Module used for reading and writing data

# (GeeksforGeeks, 2022)
In [95]:
# Read the CSV file
with open('TeleCom_Data-1.csv', 'r') as file:
    csv_text = file.read()

# Define the separator
separator = ';'

# Handle special symbols to split in the csv into seperate columns using regular expressions
csv_text = re.sub(r'";', separator, csv_text)
csv_text = re.sub(r';"', separator, csv_text)
csv_text = re.sub(r'";"', separator, csv_text)

# Convert the csv into a dataframe using pandas
df = pd.read_csv(io.StringIO(csv_text), sep=separator)

# Display the first few rows of the dataframe
df.head()

#(Grabbing CSV Information With Regex in Python, n.d.) 
Out[95]:
age;job marital education default housing loan contact month day_of_week duration campaign pdays previous poutcome emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed y"""
0 40;admin. married basic.6y no no no telephone may mon" 151 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no"
1 56;services married high.school no no yes telephone may mon" 307 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no"
2 45;services married basic.9y unknown no no telephone may mon" 198 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no"
3 59;admin. married professional.course no no no telephone may mon" 139 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no"
4 41;blue-collar married unknown unknown no no telephone may mon" 217 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no"
In [96]:
# Split the 'age;job' column into two separate columns: 'age' and 'job'
df[['age', 'job']] = df["age;job"].str.split(';', expand=True)

# Drop the original 'age;job' column
df.drop(columns='age;job', inplace=True)

df.head()
Out[96]:
marital education default housing loan contact month day_of_week duration campaign ... previous poutcome emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed y""" age job
0 married basic.6y no no no telephone may mon" 151 1 ... 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no" 40 admin.
1 married high.school no no yes telephone may mon" 307 1 ... 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no" 56 services
2 married basic.9y unknown no no telephone may mon" 198 1 ... 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no" 45 services
3 married professional.course no no no telephone may mon" 139 1 ... 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no" 59 admin.
4 married unknown unknown no no telephone may mon" 217 1 ... 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no" 41 blue-collar

5 rows × 21 columns

In [97]:
# Get rid of the special symbols on rows in certain columns
df['y"""'] = df['y"""'].str.replace('"', "")
df['day_of_week'] = df['day_of_week'].str.replace('"', "")
df.head()
Out[97]:
marital education default housing loan contact month day_of_week duration campaign ... previous poutcome emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed y""" age job
0 married basic.6y no no no telephone may mon 151 1 ... 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no 40 admin.
1 married high.school no no yes telephone may mon 307 1 ... 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no 56 services
2 married basic.9y unknown no no telephone may mon 198 1 ... 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no 45 services
3 married professional.course no no no telephone may mon 139 1 ... 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no 59 admin.
4 married unknown unknown no no telephone may mon 217 1 ... 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no 41 blue-collar

5 rows × 21 columns

In [98]:
# Rename the target variable column to get rid of special symbols
df.rename(columns={'y"""': 'y'}, inplace=True)
df.head()
Out[98]:
marital education default housing loan contact month day_of_week duration campaign ... previous poutcome emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed y age job
0 married basic.6y no no no telephone may mon 151 1 ... 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no 40 admin.
1 married high.school no no yes telephone may mon 307 1 ... 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no 56 services
2 married basic.9y unknown no no telephone may mon 198 1 ... 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no 45 services
3 married professional.course no no no telephone may mon 139 1 ... 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no 59 admin.
4 married unknown unknown no no telephone may mon 217 1 ... 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no 41 blue-collar

5 rows × 21 columns

Step 2 - Collect inital data¶

In [99]:
# Display the shape and first 5 rows of the data
print(df.shape)
df.head()
(41180, 21)
Out[99]:
marital education default housing loan contact month day_of_week duration campaign ... previous poutcome emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed y age job
0 married basic.6y no no no telephone may mon 151 1 ... 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no 40 admin.
1 married high.school no no yes telephone may mon 307 1 ... 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no 56 services
2 married basic.9y unknown no no telephone may mon 198 1 ... 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no 45 services
3 married professional.course no no no telephone may mon 139 1 ... 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no 59 admin.
4 married unknown unknown no no telephone may mon 217 1 ... 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no 41 blue-collar

5 rows × 21 columns

In [100]:
# Look at 5 random rows of the data
df.sample(5)
Out[100]:
marital education default housing loan contact month day_of_week duration campaign ... previous poutcome emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed y age job
12604 married professional.course no no no cellular jul mon 167 2 ... 0 nonexistent 1.4 93.918 -42.7 4.960 5228.1 no 32 technician
32038 married basic.9y no no no cellular may thu 369 1 ... 0 nonexistent -1.8 92.893 -46.2 1.327 5099.1 no 56 blue-collar
17570 single university.degree unknown no no cellular jul mon 644 3 ... 0 nonexistent 1.4 93.918 -42.7 4.962 5228.1 no 39 technician
22989 married university.degree no no no cellular aug mon 98 3 ... 0 nonexistent 1.4 93.444 -36.1 4.965 5228.1 no 50 admin.
39122 married high.school no yes no cellular dec wed 158 3 ... 0 nonexistent -3.0 92.713 -33.0 0.700 5023.5 no 31 student

5 rows × 21 columns

Step 3 - Describe data¶

In [101]:
# Check how many rows and columns, how many non-null values and datatype of the datasets
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41180 entries, 0 to 41179
Data columns (total 21 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   marital         41180 non-null  object 
 1   education       41180 non-null  object 
 2   default         41180 non-null  object 
 3   housing         41180 non-null  object 
 4   loan            41180 non-null  object 
 5   contact         41180 non-null  object 
 6   month           41180 non-null  object 
 7   day_of_week     41180 non-null  object 
 8   duration        41180 non-null  int64  
 9   campaign        41180 non-null  int64  
 10  pdays           41180 non-null  int64  
 11  previous        41180 non-null  int64  
 12  poutcome        41180 non-null  object 
 13  emp.var.rate    41180 non-null  float64
 14  cons.price.idx  41180 non-null  float64
 15  cons.conf.idx   41180 non-null  float64
 16  euribor3m       41180 non-null  float64
 17  nr.employed     41180 non-null  float64
 18  y               41180 non-null  object 
 19  age             41180 non-null  object 
 20  job             41180 non-null  object 
dtypes: float64(5), int64(4), object(12)
memory usage: 6.6+ MB
In [102]:
# Get a quick statistical overview of the data
df.describe()
Out[102]:
duration campaign pdays previous emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed
count 41180.000000 41180.000000 41180.000000 41180.000000 41180.000000 41180.000000 41180.000000 41180.000000 41180.000000
mean 258.280427 2.567800 962.516707 0.172705 0.081901 93.575508 -40.501999 3.621422 5167.053344
std 259.299856 2.770225 186.809028 0.493719 1.571037 0.578762 4.627358 1.734385 72.230334
min 0.000000 1.000000 0.000000 0.000000 -3.400000 92.201000 -50.800000 0.634000 4963.600000
25% 102.000000 1.000000 999.000000 0.000000 -1.800000 93.075000 -42.700000 1.344000 5099.100000
50% 180.000000 2.000000 999.000000 0.000000 1.100000 93.749000 -41.800000 4.857000 5191.000000
75% 319.000000 3.000000 999.000000 0.000000 1.400000 93.994000 -36.400000 4.961000 5228.100000
max 4918.000000 56.000000 999.000000 7.000000 1.400000 94.767000 -26.900000 5.045000 5228.100000
In [103]:
# Check distribution of target variable
df['y'].value_counts()
Out[103]:
no     36542
yes     4638
Name: y, dtype: int64

Step 4 - Verify data quality¶

In [104]:
# Check for missing values to be sure
df.isnull().sum()
Out[104]:
marital           0
education         0
default           0
housing           0
loan              0
contact           0
month             0
day_of_week       0
duration          0
campaign          0
pdays             0
previous          0
poutcome          0
emp.var.rate      0
cons.price.idx    0
cons.conf.idx     0
euribor3m         0
nr.employed       0
y                 0
age               0
job               0
dtype: int64
In [105]:
# Check for duplicates
df.duplicated().sum()
Out[105]:
12
In [106]:
# Group together and compare the duplicate and their original counterpart
duplicated_groups = df.groupby(list(df.columns)).filter(lambda x: len(x) > 1)
print(duplicated_groups)
# (So, 2023)
        marital            education  default housing loan    contact month  \
1261    married             basic.6y       no      no   no  telephone   may   
1262    married             basic.6y       no      no   no  telephone   may   
12256   married              unknown       no      no   no  telephone   jul   
12257   married              unknown       no      no   no  telephone   jul   
14151    single  professional.course       no      no   no   cellular   jul   
14230    single  professional.course       no      no   no   cellular   jul   
16815  divorced          high.school       no     yes   no   cellular   jul   
16952  divorced          high.school       no     yes   no   cellular   jul   
18460    single  professional.course       no     yes   no   cellular   jul   
18461    single  professional.course       no     yes   no   cellular   jul   
20068   married          high.school  unknown      no   no   cellular   aug   
20212   married          high.school  unknown      no   no   cellular   aug   
20527   married  professional.course       no     yes   no   cellular   aug   
20530   married  professional.course       no     yes   no   cellular   aug   
25179   married    university.degree       no      no   no   cellular   nov   
25213   married    university.degree       no      no   no   cellular   nov   
28472    single          high.school       no     yes   no   cellular   apr   
28473    single          high.school       no     yes   no   cellular   apr   
32501   married    university.degree       no     yes   no   cellular   may   
32512   married    university.degree       no     yes   no   cellular   may   
36946   married    university.degree       no      no   no   cellular   jul   
36947   married    university.degree       no      no   no   cellular   jul   
38251    single    university.degree       no      no   no  telephone   oct   
38277    single    university.degree       no      no   no  telephone   oct   

      day_of_week  duration  campaign  ...  previous     poutcome  \
1261          thu       124         1  ...         0  nonexistent   
1262          thu       124         1  ...         0  nonexistent   
12256         thu        88         1  ...         0  nonexistent   
12257         thu        88         1  ...         0  nonexistent   
14151         mon       331         2  ...         0  nonexistent   
14230         mon       331         2  ...         0  nonexistent   
16815         thu        43         3  ...         0  nonexistent   
16952         thu        43         3  ...         0  nonexistent   
18460         thu       128         1  ...         0  nonexistent   
18461         thu       128         1  ...         0  nonexistent   
20068         mon        33         1  ...         0  nonexistent   
20212         mon        33         1  ...         0  nonexistent   
20527         tue       127         1  ...         0  nonexistent   
20530         tue       127         1  ...         0  nonexistent   
25179         tue       123         2  ...         0  nonexistent   
25213         tue       123         2  ...         0  nonexistent   
28472         tue       114         1  ...         0  nonexistent   
28473         tue       114         1  ...         0  nonexistent   
32501         fri       348         4  ...         0  nonexistent   
32512         fri       348         4  ...         0  nonexistent   
36946         thu       252         1  ...         0  nonexistent   
36947         thu       252         1  ...         0  nonexistent   
38251         tue       120         1  ...         0  nonexistent   
38277         tue       120         1  ...         0  nonexistent   

      emp.var.rate  cons.price.idx  cons.conf.idx  euribor3m  nr.employed  \
1261           1.1          93.994          -36.4      4.855       5191.0   
1262           1.1          93.994          -36.4      4.855       5191.0   
12256          1.4          93.918          -42.7      4.966       5228.1   
12257          1.4          93.918          -42.7      4.966       5228.1   
14151          1.4          93.918          -42.7      4.962       5228.1   
14230          1.4          93.918          -42.7      4.962       5228.1   
16815          1.4          93.918          -42.7      4.962       5228.1   
16952          1.4          93.918          -42.7      4.962       5228.1   
18460          1.4          93.918          -42.7      4.968       5228.1   
18461          1.4          93.918          -42.7      4.968       5228.1   
20068          1.4          93.444          -36.1      4.965       5228.1   
20212          1.4          93.444          -36.1      4.965       5228.1   
20527          1.4          93.444          -36.1      4.966       5228.1   
20530          1.4          93.444          -36.1      4.966       5228.1   
25179         -0.1          93.200          -42.0      4.153       5195.8   
25213         -0.1          93.200          -42.0      4.153       5195.8   
28472         -1.8          93.075          -47.1      1.423       5099.1   
28473         -1.8          93.075          -47.1      1.423       5099.1   
32501         -1.8          92.893          -46.2      1.313       5099.1   
32512         -1.8          92.893          -46.2      1.313       5099.1   
36946         -2.9          92.469          -33.6      1.072       5076.2   
36947         -2.9          92.469          -33.6      1.072       5076.2   
38251         -3.4          92.431          -26.9      0.742       5017.5   
38277         -3.4          92.431          -26.9      0.742       5017.5   

         y age          job  
1261    no  39  blue-collar  
1262    no  39  blue-collar  
12256   no  36      retired  
12257   no  36      retired  
14151   no  27   technician  
14230   no  27   technician  
16815   no  47   technician  
16952   no  47   technician  
18460   no  32   technician  
18461   no  32   technician  
20068   no  55     services  
20212   no  55     services  
20527   no  41   technician  
20530   no  41   technician  
25179   no  39       admin.  
25213   no  39       admin.  
28472   no  24     services  
28473   no  24     services  
32501   no  35       admin.  
32512   no  35       admin.  
36946  yes  45       admin.  
36947  yes  45       admin.  
38251   no  71      retired  
38277   no  71      retired  

[24 rows x 21 columns]
In [107]:
# Remove the duplicate rows while keeping the first occurence
df = df.drop_duplicates()
In [108]:
# Verify duplicates are removed
print(df.duplicated().sum())  # Should return 0 now
0

Step 5 - Explore data¶

Step 5.1 - Summary statistics¶

In [109]:
# General Summary
summary_df = df.describe(include='all').transpose()

# Display Summary
print(summary_df)

#(So, 2023)
                  count unique                top   freq         mean  \
marital           41168      4            married  24914          NaN   
education         41168      8  university.degree  12162          NaN   
default           41168      3                 no  32570          NaN   
housing           41168      3                yes  21566          NaN   
loan              41168      3                 no  33931          NaN   
contact           41168      2           cellular  26131          NaN   
month             41168     10                may  13763          NaN   
day_of_week       41168      5                thu   8617          NaN   
duration        41168.0    NaN                NaN    NaN   258.311237   
campaign        41168.0    NaN                NaN    NaN     2.568087   
pdays           41168.0    NaN                NaN    NaN   962.506073   
previous        41168.0    NaN                NaN    NaN     0.172756   
poutcome          41168      3        nonexistent  35547          NaN   
emp.var.rate    41168.0    NaN                NaN    NaN     0.081937   
cons.price.idx  41168.0    NaN                NaN    NaN    93.575563   
cons.conf.idx   41168.0    NaN                NaN    NaN   -40.502261   
euribor3m       41168.0    NaN                NaN    NaN     3.621425   
nr.employed     41168.0    NaN                NaN    NaN  5167.052308   
y                 41168      2                 no  36531          NaN   
age               41168     78                 31   1947          NaN   
job               41168     12             admin.  10419          NaN   

                       std     min     25%     50%     75%     max  
marital                NaN     NaN     NaN     NaN     NaN     NaN  
education              NaN     NaN     NaN     NaN     NaN     NaN  
default                NaN     NaN     NaN     NaN     NaN     NaN  
housing                NaN     NaN     NaN     NaN     NaN     NaN  
loan                   NaN     NaN     NaN     NaN     NaN     NaN  
contact                NaN     NaN     NaN     NaN     NaN     NaN  
month                  NaN     NaN     NaN     NaN     NaN     NaN  
day_of_week            NaN     NaN     NaN     NaN     NaN     NaN  
duration        259.325938     0.0   102.0   180.0   319.0  4918.0  
campaign           2.77053     1.0     1.0     2.0     3.0    56.0  
pdays           186.835214     0.0   999.0   999.0   999.0   999.0  
previous          0.493782     0.0     0.0     0.0     0.0     7.0  
poutcome               NaN     NaN     NaN     NaN     NaN     NaN  
emp.var.rate       1.57096    -3.4    -1.8     1.1     1.4     1.4  
cons.price.idx    0.578761  92.201  93.075  93.749  93.994  94.767  
cons.conf.idx      4.62702   -50.8   -42.7   -41.8   -36.4   -26.9  
euribor3m         1.734375   0.634   1.344   4.857   4.961   5.045  
nr.employed      72.230165  4963.6  5099.1  5191.0  5228.1  5228.1  
y                      NaN     NaN     NaN     NaN     NaN     NaN  
age                    NaN     NaN     NaN     NaN     NaN     NaN  
job                    NaN     NaN     NaN     NaN     NaN     NaN  
In [110]:
import dataprep  # Library used to simplify data preparation
from dataprep.eda import create_report # Function from dataprep used for exploratory data analysis
# Data Profiling using dataprep
report = create_report(df, title="Telecom Data EDA")
display(report)

#(Dong, 2023)
#(Dataprep.eda Inside Google Colab, n.d.) 
  0%|                                                                                         | 0/2756 [00:00<…
C:\Users\ans_b\anaconda3\lib\site-packages\dask\core.py:119: RuntimeWarning: invalid value encountered in divide
  return func(*(_execute_task(a, cache) for a in args))
C:\Users\ans_b\anaconda3\lib\site-packages\dataprep\eda\distribution\render.py:274: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  df = df.append(pd.DataFrame({col: [nrows - npresent]}, index=["Others"]))
C:\Users\ans_b\anaconda3\lib\site-packages\dataprep\eda\distribution\render.py:274: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  df = df.append(pd.DataFrame({col: [nrows - npresent]}, index=["Others"]))
Telecom Data EDA
Telecom Data EDA Overview
Variables ≡
marital education default housing loan contact month day_of_week duration campaign pdays previous poutcome emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed y age job
Interactions Correlations Missing Values

Overview

Dataset Statistics

Number of Variables 21
Number of Rows 41168
Missing Cells 0
Missing Cells (%) 0.0%
Duplicate Rows 0
Duplicate Rows (%) 0.0%
Total Size in Memory 32.6 MB
Average Row Size in Memory 829.4 B
Variable Types
  • Categorical: 13
  • Numerical: 8

Dataset Insights

duration is skewed Skewed
campaign is skewed Skewed
pdays is skewed Skewed
emp.var.rate is skewed Skewed
cons.price.idx is skewed Skewed
cons.conf.idx is skewed Skewed
euribor3m is skewed Skewed
nr.employed is skewed Skewed
age has a high cardinality: 78 distinct values High Cardinality
month has constant length 3 Constant Length
day_of_week has constant length 3 Constant Length
previous has constant length 1 Constant Length
age has constant length 2 Constant Length
emp.var.rate has 17182 (41.74%) negatives Negatives
cons.conf.idx has 41168 (100.0%) negatives Negatives
  • 1
  • 2

Variables


marital

categorical

Approximate Distinct Count 4
Approximate Unique (%) 0.0%
Missing 0
Missing (%) 0.0%
Memory Size 2.8 MB
  • The largest value (married) is over 2.15 times larger than the second largest value (single)

Length

Mean 6.8311
Standard Deviation 0.6036
Median 7
Minimum 6
Maximum 8

Sample

1st row married
2nd row married
3rd row married
4th row married
5th row married

Letter

Count 281222
Lowercase Letter 281222
Space Separator 0
Uppercase Letter 0
Dash Punctuation 0
Decimal Number 0
  • The top 2 categories (married, single) take over 50.0%
  • The largest value (married) is over 2.15 times larger than the second largest value (single)

education

categorical

Approximate Distinct Count 8
Approximate Unique (%) 0.0%
Missing 0
Missing (%) 0.0%
Memory Size 3.1 MB

Length

Mean 12.7103
Standard Deviation 4.3888
Median 11
Minimum 7
Maximum 19

Sample

1st row basic.6y
2nd row high.school
3rd row basic.9y
4th row professional.cours...
5th row unknown

Letter

Count 471326
Lowercase Letter 471326
Space Separator 0
Uppercase Letter 0
Dash Punctuation 0
Decimal Number 12510
  • The top 2 categories (university.degree, high.school) take over 50.0%

default

categorical

Approximate Distinct Count 3
Approximate Unique (%) 0.0%
Missing 0
Missing (%) 0.0%
Memory Size 2.7 MB
  • The largest value (no) is over 3.79 times larger than the second largest value (unknown)

Length

Mean 3.044
Standard Deviation 2.0322
Median 2
Minimum 2
Maximum 7

Sample

1st row no
2nd row no
3rd row unknown
4th row no
5th row unknown

Letter

Count 125314
Lowercase Letter 125314
Space Separator 0
Uppercase Letter 0
Dash Punctuation 0
Decimal Number 0
  • The top 2 categories (no, unknown) take over 50.0%
  • The largest value (unknown) is over 2865.0 times larger than the second largest value (yes)

housing

categorical

Approximate Distinct Count 3
Approximate Unique (%) 0.0%
Missing 0
Missing (%) 0.0%
Memory Size 2.7 MB

Length

Mean 2.6441
Standard Deviation 0.8427
Median 3
Minimum 2
Maximum 7

Sample

1st row no
2nd row no
3rd row no
4th row no
5th row no

Letter

Count 108852
Lowercase Letter 108852
Space Separator 0
Uppercase Letter 0
Dash Punctuation 0
Decimal Number 0
  • The top 2 categories (yes, no) take over 50.0%
  • The largest value (yes) is over 21.78 times larger than the second largest value (unknown)

loan

categorical

Approximate Distinct Count 3
Approximate Unique (%) 0.0%
Missing 0
Missing (%) 0.0%
Memory Size 2.6 MB
  • The largest value (no) is over 5.43 times larger than the second largest value (yes)

Length

Mean 2.272
Standard Deviation 0.824
Median 2
Minimum 2
Maximum 7

Sample

1st row no
2nd row yes
3rd row no
4th row no
5th row no

Letter

Count 93533
Lowercase Letter 93533
Space Separator 0
Uppercase Letter 0
Dash Punctuation 0
Decimal Number 0
  • The top 2 categories (no, yes) take over 50.0%
  • The largest value (yes) is over 6.31 times larger than the second largest value (unknown)

contact

categorical

Approximate Distinct Count 2
Approximate Unique (%) 0.0%
Missing 0
Missing (%) 0.0%
Memory Size 2.9 MB
  • The largest value (cellular) is over 1.74 times larger than the second largest value (telephone)

Length

Mean 8.3653
Standard Deviation 0.4815
Median 8
Minimum 8
Maximum 9

Sample

1st row telephone
2nd row telephone
3rd row telephone
4th row telephone
5th row telephone

Letter

Count 344381
Lowercase Letter 344381
Space Separator 0
Uppercase Letter 0
Dash Punctuation 0
Decimal Number 0
  • The top 2 categories (cellular, telephone) take over 50.0%
  • The largest value (cellular) is over 1.74 times larger than the second largest value (telephone)

month

categorical

Approximate Distinct Count 10
Approximate Unique (%) 0.0%
Missing 0
Missing (%) 0.0%
Memory Size 2.7 MB
  • The largest value (may) is over 1.92 times larger than the second largest value (jul)

Length

Mean 3
Standard Deviation 0
Median 3
Minimum 3
Maximum 3

Sample

1st row may
2nd row may
3rd row may
4th row may
5th row may

Letter

Count 123504
Lowercase Letter 123504
Space Separator 0
Uppercase Letter 0
Dash Punctuation 0
Decimal Number 0
  • The top 2 categories (may, jul) take over 50.0%
  • The largest value (may) is over 1.92 times larger than the second largest value (jul)
  • month has words of constant length

day_of_week

categorical

Approximate Distinct Count 5
Approximate Unique (%) 0.0%
Missing 0
Missing (%) 0.0%
Memory Size 2.7 MB

Length

Mean 3
Standard Deviation 0
Median 3
Minimum 3
Maximum 3

Sample

1st row mon
2nd row mon
3rd row mon
4th row mon
5th row mon

Letter

Count 123504
Lowercase Letter 123504
Space Separator 0
Uppercase Letter 0
Dash Punctuation 0
Decimal Number 0
  • day_of_week has words of constant length

duration

numerical

Approximate Distinct Count 1544
Approximate Unique (%) 3.8%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 643.2 KB
Mean 258.3112
Minimum 0
Maximum 4918
Zeros 4
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • duration is skewed right (γ1 = 3.2626)

Quantile Statistics

Minimum 0
5-th Percentile 36
Q1 102
Median 180
Q3 319
95-th Percentile 753
Maximum 4918
Range 4918
IQR 217

Descriptive Statistics

Mean 258.3112
Standard Deviation 259.3259
Variance 67249.9421
Sum 1.0634e+07
Skewness 3.2626
Kurtosis 20.2385
Coefficient of Variation 1.0039
  • duration is not normally distributed (p-value 8.279478273284622e-15)
  • duration has 2963 outliers

campaign

numerical

Approximate Distinct Count 42
Approximate Unique (%) 0.1%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 643.2 KB
Mean 2.5681
Minimum 1
Maximum 56
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • campaign is skewed right (γ1 = 4.7615)

Quantile Statistics

Minimum 1
5-th Percentile 1
Q1 1
Median 2
Q3 3
95-th Percentile 7
Maximum 56
Range 55
IQR 2

Descriptive Statistics

Mean 2.5681
Standard Deviation 2.7705
Variance 7.6758
Sum 105723
Skewness 4.7615
Kurtosis 36.9613
Coefficient of Variation 1.0788
  • campaign is not normally distributed (p-value 4.483245990435291e-24)
  • campaign has 2406 outliers

pdays

numerical

Approximate Distinct Count 27
Approximate Unique (%) 0.1%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 643.2 KB
Mean 962.5061
Minimum 0
Maximum 999
Zeros 15
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • pdays is skewed left (γ1 = -4.9243)

Quantile Statistics

Minimum 0
5-th Percentile 999
Q1 999
Median 999
Q3 999
95-th Percentile 999
Maximum 999
Range 999
IQR 0

Descriptive Statistics

Mean 962.5061
Standard Deviation 186.8352
Variance 34907.3972
Sum 3.9624e+07
Skewness -4.9243
Kurtosis 22.2493
Coefficient of Variation 0.1941
  • pdays is not normally distributed (p-value 4.569894875463409e-25)
  • pdays has 1513 outliers

previous

categorical

Approximate Distinct Count 8
Approximate Unique (%) 0.0%
Missing 0
Missing (%) 0.0%
Memory Size 2.6 MB
  • The largest value (0) is over 7.8 times larger than the second largest value (1)

Length

Mean 1
Standard Deviation 0
Median 1
Minimum 1
Maximum 1

Sample

1st row 0
2nd row 0
3rd row 0
4th row 0
5th row 0

Letter

Count 0
Lowercase Letter 0
Space Separator 0
Uppercase Letter 0
Dash Punctuation 0
Decimal Number 41168
  • The top 2 categories (0, 1) take over 50.0%
  • The largest value (0) is over 7.8 times larger than the second largest value (1)
  • previous has words of constant length

poutcome

categorical

Approximate Distinct Count 3
Approximate Unique (%) 0.0%
Missing 0
Missing (%) 0.0%
Memory Size 3.0 MB
  • The largest value (nonexistent) is over 8.36 times larger than the second largest value (failure)

Length

Mean 10.4538
Standard Deviation 1.3735
Median 11
Minimum 7
Maximum 11

Sample

1st row nonexistent
2nd row nonexistent
3rd row nonexistent
4th row nonexistent
5th row nonexistent

Letter

Count 430364
Lowercase Letter 430364
Space Separator 0
Uppercase Letter 0
Dash Punctuation 0
Decimal Number 0
  • The top 2 categories (nonexistent, failure) take over 50.0%
  • The largest value (nonexistent) is over 8.36 times larger than the second largest value (failure)

emp.var.rate

numerical

Approximate Distinct Count 10
Approximate Unique (%) 0.0%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 643.2 KB
Mean 0.08194
Minimum -3.4
Maximum 1.4
Zeros 0
Zeros (%) 0.0%
Negatives 17182
Negatives (%) 41.7%
  • emp.var.rate is skewed left (γ1 = -0.7241)

Quantile Statistics

Minimum -3.4
5-th Percentile -2.9
Q1 -1.8
Median 1.1
Q3 1.4
95-th Percentile 1.4
Maximum 1.4
Range 4.8
IQR 3.2

Descriptive Statistics

Mean 0.08194
Standard Deviation 1.571
Variance 2.4679
Sum 3373.2
Skewness -0.7241
Kurtosis -1.0627
Coefficient of Variation 19.1727
  • emp.var.rate is not normally distributed (p-value 3.4333118326765044e-17)

cons.price.idx

numerical

Approximate Distinct Count 26
Approximate Unique (%) 0.1%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 643.2 KB
Mean 93.5756
Minimum 92.201
Maximum 94.767
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • cons.price.idx is skewed left (γ1 = -0.2311)

Quantile Statistics

Minimum 92.201
5-th Percentile 92.713
Q1 93.075
Median 93.749
Q3 93.994
95-th Percentile 94.465
Maximum 94.767
Range 2.566
IQR 0.919

Descriptive Statistics

Mean 93.5756
Standard Deviation 0.5788
Variance 0.335
Sum 3.8523e+06
Skewness -0.2311
Kurtosis -0.8303
Coefficient of Variation 0.006185
  • cons.price.idx is not normally distributed (p-value 1.1385099954867911e-09)

cons.conf.idx

numerical

Approximate Distinct Count 26
Approximate Unique (%) 0.1%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 643.2 KB
Mean -40.5023
Minimum -50.8
Maximum -26.9
Zeros 0
Zeros (%) 0.0%
Negatives 41168
Negatives (%) 100.0%
  • cons.conf.idx is skewed right (γ1 = 0.3037)

Quantile Statistics

Minimum -50.8
5-th Percentile -47.1
Q1 -42.7
Median -41.8
Q3 -36.4
95-th Percentile -33.6
Maximum -26.9
Range 23.9
IQR 6.3

Descriptive Statistics

Mean -40.5023
Standard Deviation 4.627
Variance 21.4093
Sum -1.6674e+06
Skewness 0.3037
Kurtosis -0.3594
Coefficient of Variation -0.1142
  • cons.conf.idx is not normally distributed (p-value 4.2657500184739915e-15)
  • cons.conf.idx has 446 outliers

euribor3m

numerical

Approximate Distinct Count 316
Approximate Unique (%) 0.8%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 643.2 KB
Mean 3.6214
Minimum 0.634
Maximum 5.045
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • euribor3m is skewed left (γ1 = -0.7093)

Quantile Statistics

Minimum 0.634
5-th Percentile 0.797
Q1 1.344
Median 4.857
Q3 4.961
95-th Percentile 4.966
Maximum 5.045
Range 4.411
IQR 3.617

Descriptive Statistics

Mean 3.6214
Standard Deviation 1.7344
Variance 3.0081
Sum 149086.823
Skewness -0.7093
Kurtosis -1.4065
Coefficient of Variation 0.4789
  • euribor3m is not normally distributed (p-value 7.991950387765408e-18)

nr.employed

numerical

Approximate Distinct Count 11
Approximate Unique (%) 0.0%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 643.2 KB
Mean 5167.0523
Minimum 4963.6
Maximum 5228.1
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • nr.employed is skewed left (γ1 = -1.044)

Quantile Statistics

Minimum 4963.6
5-th Percentile 5017.5
Q1 5099.1
Median 5191
Q3 5228.1
95-th Percentile 5228.1
Maximum 5228.1
Range 264.5
IQR 129

Descriptive Statistics

Mean 5167.0523
Standard Deviation 72.2302
Variance 5217.1967
Sum 2.1272e+08
Skewness -1.044
Kurtosis -0.004691
Coefficient of Variation 0.01398
  • nr.employed is not normally distributed (p-value 1.8705966689027615e-17)

y

categorical

Approximate Distinct Count 2
Approximate Unique (%) 0.0%
Missing 0
Missing (%) 0.0%
Memory Size 2.6 MB
  • The largest value (no) is over 7.88 times larger than the second largest value (yes)

Length

Mean 2.1126
Standard Deviation 0.3162
Median 2
Minimum 2
Maximum 3

Sample

1st row no
2nd row no
3rd row no
4th row no
5th row no

Letter

Count 86973
Lowercase Letter 86973
Space Separator 0
Uppercase Letter 0
Dash Punctuation 0
Decimal Number 0
  • The top 2 categories (no, yes) take over 50.0%

age

categorical

Approximate Distinct Count 78
Approximate Unique (%) 0.2%
Missing 0
Missing (%) 0.0%
Memory Size 2.6 MB

Length

Mean 2
Standard Deviation 0
Median 2
Minimum 2
Maximum 2

Sample

1st row 40
2nd row 56
3rd row 45
4th row 59
5th row 41

Letter

Count 0
Lowercase Letter 0
Space Separator 0
Uppercase Letter 0
Dash Punctuation 0
Decimal Number 82336
  • age has words of constant length

job

categorical

Approximate Distinct Count 12
Approximate Unique (%) 0.0%
Missing 0
Missing (%) 0.0%
Memory Size 2.9 MB

Length

Mean 8.9555
Standard Deviation 2.1645
Median 10
Minimum 6
Maximum 13

Sample

1st row admin.
2nd row services
3rd row services
4th row admin.
5th row blue-collar

Letter

Count 347587
Lowercase Letter 347587
Space Separator 0
Uppercase Letter 0
Dash Punctuation 10673
Decimal Number 0

Interactions

Correlations

Missing Values

Report generated with DataPrep

Key findings from 5.1 Summary statistics¶

Key insights:

  • Most individuals are married with a university degree.
  • Most fields like Default, Housing, and Loan typically have the answer 'no'.
  • Cellular contact was predominant, especially in May.
  • Thursday was a slightly more common contact day.
  • Most customers had not participated in a prior marketing campaign.
  • A majority did not subscribe to the Telecom plan.
  • Admin was the popular job category.

Potential problems to solve:

  • Skewed variable: A lot of variables distribution is not symmetric, not ideal for an potential modeling phase. Especially the feature 'pdays'
  • High cardinality: The feature 'age' has high cardinality with 78 unique values.

Actions taken after key findings from 5.1 Summary statistics¶

In [111]:
# Take a quick look at the 'pdays' feature
unique_values = df['pdays'].unique()
print(sorted(unique_values))
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 25, 26, 27, 999]
In [112]:
# Transform pdays into contacted_recently
df['contacted_recently'] = df['pdays'].apply(lambda x: 0 if x == 999 else 1)

# Drop the original 'pdays' column
df.drop('pdays', axis=1, inplace=True)

# (So, 2023)
In [113]:
# Take a look at the new feature
df['contacted_recently'].value_counts()
Out[113]:
0    39655
1     1513
Name: contacted_recently, dtype: int64
In [114]:
# Take a quick look at the age feature
unique_values = df['age'].unique()
print(sorted(unique_values))
['17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49', '50', '51', '52', '53', '54', '55', '56', '57', '58', '59', '60', '61', '62', '63', '64', '65', '66', '67', '68', '69', '70', '71', '72', '73', '74', '75', '76', '77', '78', '79', '80', '81', '82', '83', '84', '85', '86', '87', '88', '89', '91', '92', '94', '95', '98']
In [115]:
# Define function to bin ages
def age_bins(age):
    age = int(age)  # Convert string age to integer
    if 17 <= age <= 30:
        return 'Young'
    elif 31 <= age <= 45:
        return 'Young Adult'
    elif 46 <= age <= 60:
        return 'Middle Age'
    elif 61 <= age <= 75:
        return 'Senior'
    else:
        return 'Elderly'

# Apply the function to the 'age' column
df['age_group'] = df['age'].apply(age_bins)

# Drop the original 'age' column
df.drop('age', axis=1, inplace=True)
# (How to Use if Else Statement in Python to Bin a Variable, n.d.)
In [116]:
# Take a look at the new feature
df['age_group'].value_counts()
Out[116]:
Young Adult    21964
Middle Age     10916
Young           7381
Senior           662
Elderly          245
Name: age_group, dtype: int64

Visualisations for 5.1 Summary statistics¶

In [117]:
# Prepare the data for exploration
# Convert target variable 'y' to binary format for correlation.
df['y_binary'] = df['y'].apply(lambda x: 1 if x == 'yes' else 0)

# Identify the categorical and numerical columns.
categorical_columns = df.select_dtypes(include=['object']).columns.drop('y')
numerical_columns = df.select_dtypes(exclude=['object']).columns.drop('y_binary')

# (So, 2023)
In [118]:
# Import libraries used for visualisation
import matplotlib.pyplot as plt
import seaborn as sns

# Visualisation of the distribution of the numerical features
for col in numerical_columns:
    plt.figure(figsize=(10, 6))
    sns.histplot(df[col], kde=True)
    plt.title(f'Distribution of {col}')
    plt.show()

# (So, 2023)
In [119]:
# Visualisation of the distribution of the categorical features
for col in categorical_columns:
    plt.figure(figsize=(10, 6))
    sns.countplot(data=df, x=col, order = df[col].value_counts().index)
    plt.title(f'Count of Categories in {col}')
    plt.xticks(rotation=45)
    plt.show()
    
# (So, 2023)
In [120]:
# Distribution of the newly transformed feature
plt.figure(figsize=(8, 5))
sns.countplot(data=df, x='contacted_recently')
plt.title('Distribution of Contacted Recently')
plt.show()

# (So, 2023)
In [121]:
# Create plots for the target variable
from dataprep.eda import plot
plot(df, 'y')

# (Dong, 2023)
  0%|                                                                                           | 0/76 [00:00<…
Out[121]:
DataPrep.EDA Report

Overview

Approximate Distinct Count2
Approximate Unique (%)0.0%
Missing0
Missing (%)0.0%
Memory Size2.6 MB

Length

Mean2.1126
Standard Deviation0.3162
Median2
Minimum2
Maximum3

Sample

1st rowno
2nd rowno
3rd rowno
4th rowno
5th rowno

Letter

Count86973
Lowercase Letter86973
Space Separator0
Uppercase Letter0
Dash Punctuation0
Decimal Number0
'bar.bars': 10
Maximum number of bars to display
'bar.sort_descending': True
Whether to sort the bars in descending order
'bar.yscale': 'linear'
Y-axis scale ("linear" or "log")
'bar.color': '#1f77b4'
Color
'height': 400
Height of the plot
'width': 450
Width of the plot
  • The largest value (no) is over 7.88 times larger than the second largest value (yes)
'pie.slices': 10
Maximum number of pie slices to display
'pie.sort_descending': True
Whether to sort the slices in descending order of frequency
'pie.colors': ['#1f77b4', '#aec7e8']
List of colors
'height': 400
Height of the plot
'width': 450
Width of the plot
  • The top 2 categories (no, yes) take over 50.0%
'wordcloud.top_words': 30
Maximum number of most frequent words to display
'wordcloud.stopword': True
Whether to remove stopwords
'wordcloud.lemmatize': False
Whether to lemmatize the words
'wordcloud.stem': False
Whether to apply Potter Stem on the words
'height': 400
Height of the plot
'width': 450
Width of the plot
'wordfreq.top_words': 30
Maximum number of most frequent words to display
'wordfreq.stopword': True
Whether to remove stopwords
'wordfreq.lemmatize': False
Whether to lemmatize the words
'wordfreq.stem': False
Whether to apply Potter Stem on the words
'wordfreq.color': #1f77b4
Color
'height': 400
Height of the plot
'width': 450
Width of the plot
'wordlen.bins': 50
Number of bins in the histogram
'wordlen.yscale': 'linear'
Y-axis scale ("linear" or "log")
'wordlen.color': '#aec7e8'
Color
'height': 400
Height of the plot
'width': 450
Width of the plot
'value_table.ngroups': 10
The number of distinct values to show
Value Count Frequency (%)
no 36531
88.7%
yes 4637
 
11.3%

Step 5.2 - Identifiyng Outliers¶

In [122]:
# Take a quick look over the statistical data of the numerical features again
df.describe()
Out[122]:
duration campaign previous emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed contacted_recently y_binary
count 41168.000000 41168.000000 41168.000000 41168.000000 41168.000000 41168.000000 41168.000000 41168.000000 41168.000000 41168.000000
mean 258.311237 2.568087 0.172756 0.081937 93.575563 -40.502261 3.621425 5167.052308 0.036752 0.112636
std 259.325938 2.770530 0.493782 1.570960 0.578761 4.627020 1.734375 72.230165 0.188154 0.316151
min 0.000000 1.000000 0.000000 -3.400000 92.201000 -50.800000 0.634000 4963.600000 0.000000 0.000000
25% 102.000000 1.000000 0.000000 -1.800000 93.075000 -42.700000 1.344000 5099.100000 0.000000 0.000000
50% 180.000000 2.000000 0.000000 1.100000 93.749000 -41.800000 4.857000 5191.000000 0.000000 0.000000
75% 319.000000 3.000000 0.000000 1.400000 93.994000 -36.400000 4.961000 5228.100000 0.000000 0.000000
max 4918.000000 56.000000 7.000000 1.400000 94.767000 -26.900000 5.045000 5228.100000 1.000000 1.000000
In [123]:
import numpy as np # Library used for numerical calculations

# Initialise an empty dictionary to store the results
outliers_dict = {}

for col in numerical_columns:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1

    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    # Get count of outliers
    outliers_count = df[(df[col] < lower_bound) | (df[col] > upper_bound)].shape[0]

    # Store the results
    outliers_dict[col] = {
        "Lower_Bound": lower_bound,
        "Upper_Bound": upper_bound,
        "Outliers_Count": outliers_count
    }

# Convert the dictionary to a dataframe for better presentation
outliers_df = pd.DataFrame(outliers_dict).T

print(outliers_df)
                    Lower_Bound  Upper_Bound  Outliers_Count
duration              -223.5000     644.5000          2963.0
campaign                -2.0000       6.0000          2406.0
previous                 0.0000       0.0000          5621.0
emp.var.rate            -6.6000       6.2000             0.0
cons.price.idx          91.6965      95.3725             0.0
cons.conf.idx          -52.1500     -26.9500           446.0
euribor3m               -4.0815      10.3865             0.0
nr.employed           4905.6000    5421.6000             0.0
contacted_recently       0.0000       0.0000          1513.0

Visualisations for 5.2 - Identifiyng Outliers¶

In [124]:
# List of numeric columns where we want to check for outliers
plt.figure(figsize=(15,10))

# Plotting boxplots for each column
for i, col in enumerate(numerical_columns, 1):
    plt.subplot(3, 3, i)
    sns.boxplot(y=df[col])
    plt.ylabel(col)
    plt.title(f'Boxplot of {col}')

plt.tight_layout()
plt.show()

Key findings from 5.2 Identifiyng Outliers¶

  • Duration: 2963 data points are identified as outliers, despite 75% of the calls lasting less than 319 seconds.
  • Campaign: Similarly this feature also has lot of outliers, in total 2406 outliers. The maximum number of contacts performed during this campaign was 56, which is vastly more than 75% of the data which is just 3.
  • Previous: Data indicate that a lot of the clients were not contacted prior to the campaign but few of the clients were contacted multiple times.
  • Cons.conf.idx: 446 data points fall below the outlier threshold of -52.15 for the consumer confidence index.

Some of the features have a lot of outliers which aligns with the findings from Step 5.1 where a lot of skewed features were discovered. There is no need to immediately handle the identified the outliers but it is important to note that skewed data can have a disproportionate influence on results, leading to potential misinterpretations.

Step 5.3 - Relationship between variables¶

Relationship Between Variables¶

In [125]:
# Print the correlation matrix.
# Sort correlations with the binary version of the target variable.
correlation_matrix = df.corr()
print(correlation_matrix)

correlations_with_target = df.corr()['y_binary'].sort_values(ascending=False)
print(correlations_with_target)

# (So, 2023)
                    duration  campaign  previous  emp.var.rate  \
duration            1.000000 -0.071766  0.020642     -0.027924   
campaign           -0.071766  1.000000 -0.079197      0.150817   
previous            0.020642 -0.079197  1.000000     -0.421214   
emp.var.rate       -0.027924  0.150817 -0.421214      1.000000   
cons.price.idx      0.005262  0.127939 -0.204871      0.775615   
cons.conf.idx      -0.008076 -0.013656 -0.049801      0.196103   
euribor3m          -0.032830  0.135199 -0.454910      0.972266   
nr.employed        -0.044627  0.144150 -0.501279      0.907121   
contacted_recently  0.047540 -0.052541  0.587658     -0.270998   
y_binary            0.405328 -0.066343  0.229972     -0.298251   

                    cons.price.idx  cons.conf.idx  euribor3m  nr.employed  \
duration                  0.005262      -0.008076  -0.032830    -0.044627   
campaign                  0.127939      -0.013656   0.135199     0.144150   
previous                 -0.204871      -0.049801  -0.454910    -0.501279   
emp.var.rate              0.775615       0.196103   0.972266     0.907121   
cons.price.idx            1.000000       0.059583   0.688679     0.522810   
cons.conf.idx             0.059583       1.000000   0.277594     0.100109   
euribor3m                 0.688679       0.277594   1.000000     0.945209   
nr.employed               0.522810       0.100109   0.945209     1.000000   
contacted_recently       -0.079293       0.091918  -0.296826    -0.372440   
y_binary                 -0.136413       0.055124  -0.307641    -0.354532   

                    contacted_recently  y_binary  
duration                      0.047540  0.405328  
campaign                     -0.052541 -0.066343  
previous                      0.587658  0.229972  
emp.var.rate                 -0.270998 -0.298251  
cons.price.idx               -0.079293 -0.136413  
cons.conf.idx                 0.091918  0.055124  
euribor3m                    -0.296826 -0.307641  
nr.employed                  -0.372440 -0.354532  
contacted_recently            1.000000  0.324474  
y_binary                      0.324474  1.000000  
y_binary              1.000000
duration              0.405328
contacted_recently    0.324474
previous              0.229972
cons.conf.idx         0.055124
campaign             -0.066343
cons.price.idx       -0.136413
emp.var.rate         -0.298251
euribor3m            -0.307641
nr.employed          -0.354532
Name: y_binary, dtype: float64

Analysis of Categorical Features¶

In [126]:
# Distribution of target variable across categorical features
for col in categorical_columns:
    print(pd.crosstab(df[col], df['y'], margins=True, margins_name="Total"), "\n")
    
# (So, 2023)
y            no   yes  Total
marital                     
divorced   4134   476   4610
married   22385  2529  24914
single     9944  1620  11564
unknown      68    12     80
Total     36531  4637  41168 

y                       no   yes  Total
education                              
basic.4y              3747   428   4175
basic.6y              2103   188   2291
basic.9y              5571   473   6044
high.school           8479  1031   9510
illiterate              14     4     18
professional.course   4644   594   5238
university.degree    10494  1668  12162
unknown               1479   251   1730
Total                36531  4637  41168 

y           no   yes  Total
default                    
no       28376  4194  32570
unknown   8152   443   8595
yes          3     0      3
Total    36531  4637  41168 

y           no   yes  Total
housing                    
no       16588  2024  18612
unknown    883   107    990
yes      19060  2506  21566
Total    36531  4637  41168 

y           no   yes  Total
loan                       
no       30084  3847  33931
unknown    883   107    990
yes       5564   683   6247
Total    36531  4637  41168 

y             no   yes  Total
contact                      
cellular   22281  3850  26131
telephone  14250   787  15037
Total      36531  4637  41168 

y         no   yes  Total
month                    
apr     2092   539   2631
aug     5521   655   6176
dec       93    89    182
jul     6521   648   7169
jun     4759   559   5318
mar      270   276    546
may    12877   886  13763
nov     3682   414   4096
oct      402   315    717
sep      314   256    570
Total  36531  4637  41168 

y               no   yes  Total
day_of_week                    
fri           6979   846   7825
mon           7661   846   8507
thu           7574  1043   8617
tue           7133   953   8086
wed           7184   949   8133
Total        36531  4637  41168 

y               no   yes  Total
poutcome                       
failure       3645   605   4250
nonexistent  32407  3140  35547
success        479   892   1371
Total        36531  4637  41168 

y                 no   yes  Total
job                              
admin.          9068  1351  10419
blue-collar     8614   638   9252
entrepreneur    1332   124   1456
housemaid        953   106   1059
management      2595   328   2923
retired         1283   433   1716
self-employed   1272   149   1421
services        3642   323   3965
student          600   275    875
technician      6009   729   6738
unemployed       870   144   1014
unknown          293    37    330
Total          36531  4637  41168 

y               no   yes  Total
age_group                      
Elderly        116   129    245
Middle Age    9872  1044  10916
Senior         378   284    662
Young         6257  1124   7381
Young Adult  19908  2056  21964
Total        36531  4637  41168 

In [127]:
# Calculates the average value of the target variable for each categorical feature
for col in categorical_columns:
    print(df.groupby(col)['y_binary'].mean(), "\n")
marital
divorced    0.103254
married     0.101509
single      0.140090
unknown     0.150000
Name: y_binary, dtype: float64 

education
basic.4y               0.102515
basic.6y               0.082060
basic.9y               0.078259
high.school            0.108412
illiterate             0.222222
professional.course    0.113402
university.degree      0.137148
unknown                0.145087
Name: y_binary, dtype: float64 

default
no         0.128769
unknown    0.051542
yes        0.000000
Name: y_binary, dtype: float64 

housing
no         0.108747
unknown    0.108081
yes        0.116201
Name: y_binary, dtype: float64 

loan
no         0.113377
unknown    0.108081
yes        0.109332
Name: y_binary, dtype: float64 

contact
cellular     0.147335
telephone    0.052338
Name: y_binary, dtype: float64 

month
apr    0.204865
aug    0.106056
dec    0.489011
jul    0.090389
jun    0.105115
mar    0.505495
may    0.064375
nov    0.101074
oct    0.439331
sep    0.449123
Name: y_binary, dtype: float64 

day_of_week
fri    0.108115
mon    0.099448
thu    0.121040
tue    0.117858
wed    0.116685
Name: y_binary, dtype: float64 

poutcome
failure        0.142353
nonexistent    0.088334
success        0.650620
Name: y_binary, dtype: float64 

job
admin.           0.129667
blue-collar      0.068958
entrepreneur     0.085165
housemaid        0.100094
management       0.112213
retired          0.252331
self-employed    0.104856
services         0.081463
student          0.314286
technician       0.108192
unemployed       0.142012
unknown          0.112121
Name: y_binary, dtype: float64 

age_group
Elderly        0.526531
Middle Age     0.095639
Senior         0.429003
Young          0.152283
Young Adult    0.093608
Name: y_binary, dtype: float64 

In [128]:
# Chi-squared tests
from scipy.stats import chi2_contingency # Check if there is a signifcant association between two categorical variables

# (So, 2023)

for col in categorical_columns:
    contingency = pd.crosstab(df[col], df['y'])
    chi2, p, _, _ = chi2_contingency(contingency)
    print(f"Chi-squared test for {col}:")
    print(f"Chi2 Statistic: {chi2}, P-Value: {p}\n")
# (Python: Chi Squared for Categorical Values in Large Dataset, n.d.)
Chi-squared test for marital:
Chi2 Statistic: 123.24250810497784, P-Value: 1.545367250484738e-26

Chi-squared test for education:
Chi2 Statistic: 192.40106144950568, P-Value: 4.659316429321491e-38

Chi-squared test for default:
Chi2 Statistic: 406.1668754932802, P-Value: 6.338458054692336e-89

Chi-squared test for housing:
Chi2 Statistic: 5.764757440253741, P-Value: 0.056001392632042554

Chi-squared test for loan:
Chi2 Statistic: 1.0741024238194457, P-Value: 0.5844691895727654

Chi-squared test for contact:
Chi2 Statistic: 860.8334981810171, P-Value: 3.2089162729521324e-189

Chi-squared test for month:
Chi2 Statistic: 3103.6905347599236, P-Value: 0.0

Chi-squared test for day_of_week:
Chi2 Statistic: 26.033472341587366, P-Value: 3.115661068005872e-05

Chi-squared test for poutcome:
Chi2 Statistic: 4217.65740769466, P-Value: 0.0

Chi-squared test for job:
Chi2 Statistic: 960.0161267578051, P-Value: 7.691135758268049e-199

Chi-squared test for age_group:
Chi2 Statistic: 1310.0366217187566, P-Value: 2.218594897381302e-282

Analysis of Numerical Features¶

In [129]:
# Stats for numerical columns grouped by the target variable 
grouped_stats = df.groupby('y')[numerical_columns].agg(['mean', 'std'])
print(grouped_stats)

# (So, 2023)
       duration              campaign            previous            \
           mean         std      mean       std      mean       std   
y                                                                     
no   220.862719  207.131218  2.633571  2.873951  0.132299  0.408760   
yes  553.336424  401.245018  2.052189  1.666640  0.491482  0.856749   

    emp.var.rate           cons.price.idx           cons.conf.idx            \
            mean       std           mean       std          mean       std   
y                                                                             
no      0.248865  1.482935      93.603691  0.558970    -40.593132  4.390251   
yes    -1.233146  1.623964      93.353968  0.676101    -39.786370  6.136721   

    euribor3m            nr.employed            contacted_recently            
         mean       std         mean        std               mean       std  
y                                                                             
no   3.811519  1.638147  5176.175703  64.556246           0.015001  0.121558  
yes  2.123831  1.742947  5095.176795  87.557909           0.208109  0.405999  
In [130]:
# T-test
from scipy.stats import ttest_ind # Check if mean of numerical features differ between two groups

print("T-test results for numerical columns:\n")

for col in numerical_columns:
    group_no = df[df['y'] == 'no'][col]
    group_yes = df[df['y'] == 'yes'][col]
    
    t_stat, p = ttest_ind(group_no, group_yes)
    
    print(f"T-test for {col}:")
    print(f"T-statistic: {t_stat}, P-Value: {p}\n")
    
# (So, 2023)
# (Python P-value From T-statistic, n.d.)
T-test results for numerical columns:

T-test for duration:
T-statistic: -89.95966640395275, P-Value: 0.0

T-test for campaign:
T-statistic: 13.490286751781703, P-Value: 2.1850331835872513e-41

T-test for previous:
T-statistic: -47.94505876847229, P-Value: 0.0

T-test for emp.var.rate:
T-statistic: 63.39870878307583, P-Value: 0.0

T-test for cons.price.idx:
T-statistic: 27.938511733640116, P-Value: 3.545038414151215e-170

T-test for cons.conf.idx:
T-statistic: -11.201308141760773, P-Value: 4.426471223158027e-29

T-test for euribor3m:
T-statistic: 65.59994497937804, P-Value: 0.0

T-test for nr.employed:
T-statistic: 76.92946728921919, P-Value: 0.0

T-test for contacted_recently:
T-statistic: -69.59967481638431, P-Value: 0.0

Visualisation for 5.3 - Relationship between variables¶

In [131]:
# 1. Heatmap of the Correlation Matrix
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.title("Correlation Heatmap")
plt.show()

# (So, 2023)
In [132]:
from dataprep.eda import plot_correlation
# Use the plot_correlation function
plot_correlation(df, 'y_binary')

# (So, 2023)
  0%|                                                                                          | 0/226 [00:00<…
Out[132]:
DataPrep.EDA Report
'height': 400
Height of the plot
'width': 400
Width of the plot
'height': 400
Height of the plot
'width': 400
Width of the plot
'height': 400
Height of the plot
'width': 400
Width of the plot
In [133]:
# Bar plots for the average value of the target variable for each categorical feature
plt.figure(figsize=(15, 10))
for index, col in enumerate(categorical_columns, 1):
    plt.subplot(3, 4, index)
    df.groupby(col)['y_binary'].mean().plot(kind='bar', color='skyblue')
    plt.ylabel('Mean Target Rate')
    plt.title(f"Mean Target Rate by {col}")
    plt.tight_layout()

plt.show()
In [134]:
# Box Plots for numeric features compared to the target variable
numeric_columns = df.select_dtypes(include=['float64', 'int64']).columns.drop('y_binary').tolist() # Make new column without binary version
plt.figure(figsize=(15, 10))
for index, col in enumerate(numeric_columns, 1):
    plt.subplot(3, 4, index)
    sns.boxplot(x="y_binary", y=col, data=df, palette='pastel')
    plt.title(f"Boxplot of {col} vs. Target")
    plt.tight_layout()

plt.show()

Key findings from 5.3 Relationship between variables¶

  1. Categorical Features
  • Month: March and December have notably high success rates in the target variable, while May has the highest count but a lower success rate.
  • Previous outcome: 'success' has a significantly higher positive outcome for the target variable.
  • Job type: Students and the retired have relatively higher positive responses compared to other job categories.
  • Age groups: The Elderly and Senior groups have significantly higher positive response rates compared to other age groups.
  1. Numerical features
  • Duration: has the highest positive correlation (0.405328) with the target variable.
  • Contacted_recently and Previous: These features have a decent positive correlation, 0.324474 and 0.229972 respectively.
  • nr.employed, euribor3m, and emp.var.rate: Have notable negative correlations with values -0.354532, -0.307641, and -0.298251 respectively.